Real-time jitter control and packet-loss concealment in an audio signal

ABSTRACT

An “adaptive audio playback controller” operates by decoding and reading received packets of an audio signal into a signal buffer. Samples of the decoded audio signal are then played out of the signal buffer according to the needs of a player device. Jitter control and packet loss concealment are accomplished by continuously analyzing buffer content in real-time, and determining whether to provide unmodified playback from the buffer contents, whether to compress buffer content, stretch buffer content, or whether to provide for packet loss concealment for overly delayed or lost packets as a function of buffer content. Further, the adaptive audio playback controller also determines where to stretch or compress particular frames or signal segments in the signal buffer, and how much to stretch or compress such segments in order to optimize perceived playback quality.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a Divisional Application of U.S. patent applicationSer. No. 10/660,326, filed on Sep. 10, 2003, by Florencio, et al., andentitled “A SYSTEM AND METHOD FOR REAL-TIME JITTER CONTROL ANDPACKET-LOSS CONCEALMENT IN AN AUDIO SIGNAL,” and claims the benefit ofthat prior application under Title 35, U.S. Code, Section 120.

BACKGROUND

1. Technical Field

The invention is related to receipt and playback of packet-based audiosignals, and in particular, to a system and method for providingautomatic jitter control and packet loss concealment for audio signalsbroadcast across a packet-based network or communications channel.

2. Related Art

Conventional packet communication systems, such as the Internet or otherbroadcast network, are typically lossy. In other words, not everytransmitted packet can be guaranteed to be delivered either error free,on time, or even in the correct sequence. Further, any delay in deliverytime is usually variable. If the receiver can wait for packets to beretransmitted, correctly ordered, or corrected using some type of errorcorrection scheme, then the fact that such networks are inherently lossyand delay prone is not an issue. However, for near real-timeapplications, such as, for example, voice-based communications systemsacross such packet-based networks, the receiver can not wait for packetsto be retransmitted, correctly ordered, or corrected without causingundue, and noticeable, lag or delay in the communication.

Many conventional schemes address minor delays in packet delivery timeby simply providing a temporary buffer of received packets incombination with a delayed playback of the received packets. Suchschemes are typically referred to as “jitter control” schemes. Ingeneral, most such schemes address delay in packet receipt by using a“jitter buffer” or the like which temporarily stores incoming packets orsignal frames and provides them to a decoder with sufficient delay thatone or more subsequent packets should have already been received. Inother words, the jitter buffer simply keeps one or more packets in abuffer for delaying playback of the incoming signal for a period longenough to ensure that a majority of packets are actually received beforethey need to be played.

A sufficient increase in the length of the buffer allows virtually allpackets to be received before they need to be played back. In fact, ifthe size of the jitter buffer is at least as long as the differencebetween the smallest and largest possible packet delays, then allpackets could be played without any apparent gap or delay betweenpackets. Unfortunately, as the length of the buffer increases, playbackof the signal increasingly lags real-time. In a one-way audio signal,such as a music broadcast, for example, this is typically not a problem.However, in systems such as real-time or two-way conversations, temporallag resulting from the use of such buffers becomes increasing apparent,and undesirable, as the buffer length increases.

In addition, the basic idea of using a buffer has been improved in manymodern communications systems by using compression and stretchingtechniques for providing temporal adjustment of the playback duration ofsignal frames. As a result, the jitter buffer length can be adaptedduring speech utterances by stretching or compressing the currentlyplaying audio signal, as necessary, for reducing the average delaywithout incurring as many late losses. Unfortunately, the use oftemporal stretching and compression techniques for frames in an audiosignal often results in audible artifacts which may be objectionable tothe human listener.

An additional conventional technique, commonly referred to as “packetloss concealment” has also been used to improve the perceived speechquality. For example, as noted above, packet loss may occur when overlydelayed packets are not received in time for playback. Typically, suchoverly delayed packets are referred to as “late loss” packets.Similarly, packet loss may also occur simply because the packet wasnever received. Conventional packet loss concealment schemes typicallyaddress such overly delayed and lost packets in the same manner by usingsome sort of packet loss concealment technique.

Further, many such schemes provide a combination of both jitter controland packet loss concealment. With respect to jitter control, mostschemes determine the size of the jitter buffer by determining a minimumbuffer size as a compromise between late or actual loss and packetdelay. Further, a number of conventional schemes offer some sort ofnetwork analysis for further optimizing buffer size for minimizing delayand maximizing timely packet receipt. Packets that are determined to belate loss packets are typically handled in the same way as if they wereactually lost. In fact, actually lost packets are typically declared tobe a late loss anyway, as whatever delay criteria is used fordetermining a late loss will also be met by an actually lost packet. Ineither case, conventional decoders implement some sort of errorconcealment to hide the fact that the packet that should be played hasnot been received.

One conventional scheme uses both jitter control and packet lossconcealment. In general, this scheme minimizes the length of the jitterbuffer by allowing each packet to be stretched and/or compressed, asneeded to account for delayed packet receipt while still maintaining oneor more packets in the jitter buffer. In particular, this scheme firstintroduces a one-packet delay, in order to wait for a packet to beeither received, or declared lost, before deciding on whether the packetto be played currently should be stretched or compressed. Further, thisscheme analyzes network performance on an ongoing basis to determinewhether packets scheduled to be played in the near future are likely tobe received on time. Received packets are then stretched or compressed,as necessary, to ensure that the buffer is not empty before the nextscheduled packet arrival time.

However, when a packet does not arrive by the scheduled time, it isdeclared to be a late loss, and error concealment is then used to hidethat loss. Most modern schemes use some form of stretching andcompression in combination with a windowing technique for mergingboundaries of packets bordering missing packets declared to be late losspackets. In general, such schemes typically operate by decomposing inputpackets input into overlapping segments of equal length. Theseoverlapping segments are then realigned and superimposed via aconventional correlation process along with smoothing of the overlapregions to form an output segment having a degree of overlap whichresults in the desired output length. The result is that the compositesegment is useful for hiding or concealing perceived packet delay orloss. Unfortunately, such schemes typically make packet-based decisionsregarding whether a packet is to be declared as late loss. Consequently,such schemes often declare packets to be a late loss when they areactually received in sufficient time that they could have been played asa part of the signal playback.

Therefore, what is needed is a system and method that provides for bothjitter control and packet loss concealment. This scheme should minimizebuffer length, and thus delay, while also minimizing any artifactsresulting from either stretching or compression of audio segments.Further, rather than using a simple packet-based determination fordeciding late loss for particular packets, the decision should be madeas a function of buffer content for reducing overall buffer size anddelay.

SUMMARY

Jitter control and packet loss concealment are two well-known techniquesfor improving the quality of signals transmitted across lossy and delayprone packet-based networks such as the Internet and other conventionalvoice-based communications channels. Clearly, signal quality and systemperformance improves as a function of both reduced delay and reducedsignal artifacts. Thus, to address the need for high quality audiojitter control and packet loss concealment, an “adaptive audio playbackcontroller” is provided for performing automatic buffer-based adaptivejitter control and packet loss concealment for audio signals transmittedacross a packet-based network as a function of buffer content. Further,the de-jittering and packet loss concealment processes described hereinare compatible with most conventional codecs for decoding and providinga playback of audio signals.

In general, the adaptive audio playback controller operates by firstusing a conventional codec for decoding and reading transmitted signalframes into a signal buffer as soon as those frames have been received.Samples of the decoded audio signal are then played out of the bufferaccording to the needs of the player device. Note that the size of theinput frame read into the buffer and the size of the output frame (i.e.,the sample output to the player device) do not need to be the same.Input frame size is determined by the codec, and some codecs use largerframe sizes to save on bitrate. Output frame size is determined by thebuffering system on the playout or playback device. For example, in atested embodiment, a 10 ms output frame was used in combination with a20 ms input frame. However, rather than simply playing back the frames,the adaptive audio playback controller stretches or compresses thecontent of the buffer, as necessary, to perform real-time jitter controland packet loss concealment as a function of buffer content rather thana function of expected packet receipt time as with conventional schemes.

Primary components of the de-jittering processes include bufferanalysis, adaptive signal stretching processes, and adaptive signalcompression processes. These processes operate based on a maximum andminimum buffer size. In a tested embodiment, a 10 ms minimum buffer sizewas used to guarantee enough data is present in the buffer to allow forgood quality stretching. In contrast, the maximum buffer size isdesigned as a tradeoff between minimizing the probability that any givensample will need to be stretched, and the delay resulting from increasedbuffer size. For example, in one embodiment, maximum buffer size wasdetermined by performing a conventional statistical modeling of thebroadcast channel or network, and setting the maximum buffer size at alevel that will guarantee receipt of at least a minimum threshold numberof data packets, such as, for example, 95% of the packets, before thosepackets are needed for playback. Methods for performing such statisticalmodeling of packet receipt across a network channel are well known tothose skilled in the art, and will not be described in detail herein.

As noted above, one of the components of the adaptive audio playbackcontroller involves a signal stretching process. Conventional signalstretching schemes typically stretch a received frame of the audiosignal until the time scheduled for arrival of the next packet. However,these schemes will declare a packet as a “late loss” when it is notreceived within a certain predetermined period of time. For example,such schemes typically set a time limit for receiving a packet n thatexpires soon after the time a prior packet, i.e., packet n−1, wasreceived. If packet n is not received by that predetermined time, a lateloss is declared, and “loss concealment” techniques are then used forconcealing that loss. Thus, such schemes are packet-based.

In contrast, the adaptive audio playback controller described hereinoperates as a function of buffer content rather than packet receipttime. For example, unlike conventional stretching schemes, the audioplayback controller begins stretching the contents of the bufferwhenever a particular packet, e.g., packet n, has not arrived by theexpected or scheduled time. In this case, the signal existing in thebuffer is stretched until the delayed packet arrives, or until it iseventually declared “lost.” This differs from conventional stretchingprocesses in that rather than immediately declaring a packet as a “lateloss” when it is not received within a predetermined period of time, thecontents of the buffer are stretched while simultaneously determining anappropriate time limit for declaring that packet to be a late loss as afunction of the current buffer contents. Furthermore, the receipt of asubsequent packet (e.g., packet n+1, where packet n represents theexpected packet) will change this time limit. Consequently, the adaptiveaudio playback controller provides a significantly increased packetreceipt time prior to declaring a late loss for any given packet. As aresult, packet “late loss” is significantly reduced, thereby resultingin a significantly reduced use of packet loss concealment processes forreducing artifacts in the signal, and a perceptibly cleaner signalplayback. Further, and more importantly, the increased packet receipttime does not come at the cost of increased signal delay.

In particular, rather than setting a time limit for declaring packetloss, the adaptive audio playback controller simply waits for the nextpacket to be received, or until one of several “loss conditions” aresatisfied, as described below. For example, one such loss condition isto set a maximum delay time for packet receipt. Given a sufficientlylong delay time T, late loss will only be declared in relatively extremedelay cases, when a signal connection was lost, or when a talk spurt hasended in the case where no information is sent about the end of the talkspurt. In a tested embodiment, values for the delay time Ton the orderof about 20 ms to about 1 sec were used, with values of T around 100 mstypically providing good results.

A second loss condition relates to receiving a subsequent packet priorto receiving the next expected packet in the transmission. Typically,this results from either packet inversion, or actual packet loss. Asnoted above, conventional schemes typically will generally ignore packetarrival order, and wait the maximum amount regardless of whether asubsequent packet has been received or not. In contrast, the adaptiveaudio playback controller reduces the time required to declare a lateloss whenever a subsequent packet is received prior to receiving theexpected packet. However, to minimize any declarations of “late loss”due to packet inversion, the adaptive audio playback controller stillwaits for some time before declaring a loss, even if a subsequent packethas already been received. On the other hand, since packet inversionsare rare, the waiting is kept to a minimum, in order to avoidintroducing additional artifacts in the signal. More specifically, thesignal in the buffer will not be stretched beyond the period that thebuffered signal would be stretched in the case where a packet loss wouldbe declared, as noted above. Once that time has been reached, the packetn is declared as lost, and the packet loss concealment processesdescribed herein are used to reduce or eliminate artifacts in thesignal.

As noted above, signal stretching is used to compensate for delayed orlost packets. On the other hand, signal compression is used to addressthe case where the signal buffer has become too full, with a resultingincrease in signal delay. Therefore, by compressing the signal containedin the buffer, playback time of the buffered signal is reduced, thebuffer is at least partially emptied, and the signal playback delay isreduced. As described herein, when compressing the signal, it istypically a good idea to wait for a segment of speech where compressionis expected to produce little or no artifacts, rather than simplycompress the next segment to be played out. One simple solution is tocompress only in between talk spurts. However, a better processconsiders how much compression is desired (i.e., how far behind in timesignal playback is), and how easy it is to compress a particular segmentwhile minimizing artifacts. Further, the need to compress the bufferimplies that a long signal segment is in the buffer, and that thereforethere is some freedom on where to compress the signal.

The selection of which segments to actually compress in any given frameor frames is an important decision, as it typically affects theperceived quality of the reconstructed signal for a human listener. Forexample, rather than compress all segments of a given frame equally,better results are typically achieved by employing a hierarchical orlayered approach to compression. In particular, in an audio signalincluding speech, each segment of a frame will be either a “voiced”segment, which is dominated by quasi-periodic speech, an “unvoiced”segment, dominated by aperiodic speech or other signals, or a “mixed”segment which includes both periodic and aperiodic components. Given thedetermination of segment type in the buffer, the desired compression isachieved in any given frame or frames by first compressing particularsegment types in a preferential hierarchical order.

For example, compressing segments that represent speech, silence orsimple noise, while avoiding compression of unvoiced segments ortransients, produces a reconstructed signal having less perceivableartifacts. If sufficient compression cannot be accomplished bycompressing segments representing speech, silence or simple noise, thennon-transitional unvoiced segments are compressed in the mannerdescribed above. Finally, segments including transitions are compressedif sufficient compression can not be achieved through compression of thevoiced segments or non-transitional unvoiced segments. This hierarchicalapproach to compression serves to limit perceivable artifacts in thereconstructed signal.

In view of the above summary, it is clear that the adaptive audioplayback controller provides a unique system and method for providingbuffer-based jitter control and packet loss concealment via adaptivestretching and compression of frames of a received audio signal whileminimizing perceivable artifacts in a reconstruction of that signal. Inaddition to the just described benefits, other advantages of the systemand method for providing buffer-based jitter control and packet lossconcealment for a received audio signal will become apparent from thedetailed description which follows hereinafter when taken in conjunctionwith the accompanying drawing figures.

DESCRIPTION OF THE DRAWINGS

The specific features, aspects, and advantages of the present inventionwill become better understood with regard to the following description,appended claims, and accompanying drawings where:

FIG. 1 is a general system diagram depicting a general-purpose computingdevice constituting an exemplary system for providing adaptivebuffer-based jitter control and packet loss concealment for playback ofan audio signal.

FIG. 2 illustrates an exemplary architectural diagram showing exemplaryprogram modules for adaptive buffer-based jitter control and packet lossconcealment for playback of an audio signal.

FIG. 3 illustrates an exemplary system flow diagram for adaptivebuffer-based jitter control and packet loss concealment for playback ofan audio signal.

FIG. 4 illustrates an exemplary system flow diagram for determining whento declare packet late loss and implement packet loss concealmentprocesses for playback of an audio signal.

FIG. 5 illustrates an exemplary system flow diagram for implementingpacket loss concealment processes for playback of an audio signalfollowing a determination of packet late loss.

FIG. 6 illustrates an exemplary system flow diagram for determining howmuch particular segments of a signal buffer should be stretched tocompensate for packet delay for playback of an audio signal.

FIG. 7 illustrates an exemplary system flow diagram for adaptivebuffer-based jitter control and packet loss concealment in a LinearPredictive Coding (LPC) residual domain rather than a signal domain forplayback of an audio signal.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the following description of the preferred embodiments of the presentinvention, reference is made to the accompanying drawings, which form apart hereof, and in which is shown by way of illustration specificembodiments in which the invention may be practiced. It is understoodthat other embodiments may be utilized and structural changes may bemade without departing from the scope of the present invention.

1.0 Exemplary Operating Environment:

FIG. 1 illustrates an example of a suitable computing system environment100 on which the invention may be implemented. The computing systemenvironment 100 is only one example of a suitable computing environmentand is not intended to suggest any limitation as to the scope of use orfunctionality of the invention. Neither should the computing environment100 be interpreted as having any dependency or requirement relating toany one or combination of components illustrated in the exemplaryoperating environment 100.

The invention is operational with numerous other general purpose orspecial purpose computing system environments or configurations.Examples of well known computing systems, environments, and/orconfigurations that may be suitable for use with the invention include,but are not limited to, personal computers, server computers, hand-held,laptop or mobile computer or communications devices such as cell phonesand PDA's, multiprocessor systems, microprocessor-based systems, set topboxes, programmable consumer electronics, network PCs, minicomputers,mainframe computers, distributed computing environments that include anyof the above systems or devices, and the like.

The invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, etc., that performparticular tasks or implement particular abstract data types. Theinvention may also be practiced in distributed computing environmentswhere tasks are performed by remote processing devices that are linkedthrough a communications network. In a distributed computingenvironment, program modules may be located in both local and remotecomputer storage media including memory storage devices. With referenceto FIG. 1, an exemplary system for implementing the invention includes ageneral-purpose computing device in the form of a computer 110.

Components of computer 110 may include, but are not limited to, aprocessing unit 120, a system memory 130, and a system bus 121 thatcouples various system components including the system memory to theprocessing unit 120. The system bus 121 may be any of several types ofbus structures including a memory bus or memory controller, a peripheralbus, and a local bus using any of a variety of bus architectures. By wayof example, and not limitation, such architectures include IndustryStandard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus,Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA)local bus, and Peripheral Component Interconnect (PCI) bus also known asMezzanine bus.

Computer 110 typically includes a variety of computer readable media.Computer readable media can be any available media that can be accessedby computer 110 and includes both volatile and nonvolatile media,removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media includes volatile andnonvolatile removable and non-removable media implemented in any methodor technology for storage of information such as computer readableinstructions, data structures, program modules, or other data.

Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory, or other memory technology; CD-ROM, digitalversatile disks (DVD), or other optical disk storage; magneticcassettes, magnetic tape, magnetic disk storage, or other magneticstorage devices; or any other medium which can be used to store thedesired information and which can be accessed by computer 110.Communication media typically embodies computer readable instructions,data structures, program modules or other data in a modulated datasignal such as a carrier wave or other transport mechanism and includesany information delivery media. The term “modulated data signal” means asignal that has one or more of its characteristics set or changed insuch a manner as to encode information in the signal. By way of example,and not limitation, communication media includes wired media such as awired network or direct-wired connection, and wireless media such asacoustic, RF, infrared, and other wireless media. Combinations of any ofthe above should also be included within the scope of computer readablemedia.

The system memory 130 includes computer storage media in the form ofvolatile and/or nonvolatile memory such as read only memory (ROM) 131and random access memory (RAM) 132. A basic input/output system 133(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 110, such as during start-up, istypically stored in ROM 131. RAM 132 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 120. By way of example, and notlimitation, FIG. 1 illustrates operating system 134, applicationprograms 135, other program modules 136, and program data 137.

The computer 110 may also include other removable/non-removable,volatile/nonvolatile computer storage media. By way of example only,FIG. 1 illustrates a hard disk drive 141 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 151that reads from or writes to a removable, nonvolatile magnetic disk 152,and an optical disk drive 155 that reads from or writes to a removable,nonvolatile optical disk 156 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 141 is typically connectedto the system bus 121 through a non-removable memory interface such asinterface 140, and magnetic disk drive 151 and optical disk drive 155are typically connected to the system bus 121 by a removable memoryinterface, such as interface 150.

The drives and their associated computer storage media discussed aboveand illustrated in FIG. 1, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 110. In FIG. 1, for example, hard disk drive 141 is illustratedas storing operating system 144, application programs 145, other programmodules 146, and program data 147. Note that these components can eitherbe the same as or different from operating system 134, applicationprograms 135, other program modules 136, and program data 137. Operatingsystem 144, application programs 145, other program modules 146, andprogram data 147 are given different numbers here to illustrate that, ata minimum, they are different copies. A user may enter commands andinformation into the computer 110 through input devices such as akeyboard 162 and pointing device 161, commonly referred to as a mouse,trackball, or touch pad.

In addition, the computer 110 may also include a speech input device,such as a microphone 198 or a microphone array, as well as a loudspeaker197 or other sound output device connected via an audio interface 199.Other input devices (not shown) may include a joystick, game pad,satellite dish, scanner, radio receiver, and a television or broadcastvideo receiver, or the like. These and other input devices are oftenconnected to the processing unit 120 through a user input interface 160that is coupled to the system bus 121, but may be connected by otherinterface and bus structures, such as, for example, a parallel port,game port, or a universal serial bus (USB). A monitor 191 or other typeof display device is also connected to the system bus 121 via aninterface, such as a video interface 190. In addition to the monitor,computers may also include other peripheral output devices such as aprinter 196, which may be connected through an output peripheralinterface 195.

The computer 110 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer180. The remote computer 180 may be a personal computer, a server, arouter, a network PC, a peer device, or other common network node, andtypically includes many or all of the elements described above relativeto the computer 110, although only a memory storage device 181 has beenillustrated in FIG. 1. The logical connections depicted in FIG. 1include a local area network (LAN) 171 and a wide area network (WAN)173, but may also include other networks. Such networking environmentsare commonplace in offices, enterprise-wide computer networks,intranets, and the Internet.

When used in a LAN networking environment, the computer 110 is connectedto the LAN 171 through a network interface or adapter 170. When used ina WAN networking environment, the computer 110 typically includes amodem 172 or other means for establishing communications over the WAN173, such as the Internet. The modem 172, which may be internal orexternal, may be connected to the system bus 121 via the user inputinterface 160, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 110, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 1 illustrates remoteapplication programs 185 as residing on memory device 181. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

The exemplary operating environment having now been discussed, theremaining part of this description will be devoted to a discussion ofthe program modules and processes embodying an “adaptive audio playbackcontroller” for performing automatic buffer-based adaptive jittercontrol and packet loss concealment for audio signals transmitted acrossa packet-based network as a function of buffer content.

2.0 Introduction:

Jitter control, or de-jittering, and packet loss concealment has beenused for a number of years for improving the perceived playback qualityof speech-based signals transmitted across lossy and delay pronepacket-based networks such as the Internet or other communicationsnetwork. An adaptive audio playback controller, as described hereinprovides for reduced signal delay time, improved jitter control, andimproved packet loss concealment through use of a buffer-content basedprocess for determining when and where particular frames or audiosegments are to be stretched or compressed, and when to apply lossconcealment techniques so as to minimize packet loss and artifactsresulting from such packet loss.

In general, the adaptive audio playback controller operates by firstusing a conventional codec for decoding and reading received packetsinto a signal buffer as soon as those packets have been received anddecoded into signal frames. Samples of the decoded audio signal are thenplayed out of the signal buffer according to the needs of the playerdevice. Jitter control and packet loss concealment are accomplished bycontinuously analyzing buffer content in real-time, and determiningwhether to provide unmodified playback from the buffer contents, whetherto compress buffer content, stretch buffer content, or whether toprovide for packet loss concealment for overly delayed or lost packets.Further, in addition to automatically determining whether to providestraight playback, or processed playback (compression, stretching, orpacket loss concealment), the adaptive audio playback controller alsodetermines where to stretch or compress particular frames or signalsegments in the signal buffer, and how much to stretch or compress suchsegments in order to optimize perceived playback quality. The frames,either processed or unmodified, are then provided for immediateplayback, as needed by a playback device.

2.1 System Overview:

The adaptive audio playback controller provides for automaticbuffer-based adaptive jitter control and packet loss concealment foraudio signals transmitted across a packet-based network as a function ofbuffer content. Primary components of the de-jittering processes includebuffer analysis, adaptive signal stretching, and adaptive signalcompression. These components operate based on a maximum and minimumbuffer size. The minimum buffer size is determined by choosing a buffersize that will guarantee enough data is present in the buffer to allowfor good quality stretching. In contrast, the maximum buffer size isdesigned as a tradeoff between minimizing the probability that any givensample will need to be stretched, and the delay naturally resulting fromincreased buffer size. Typically, this choice is made as a function ofnetwork performance characteristics such as loss rates and packet delaytimes.

As noted above, the primary components of the adaptive audio playbackcontroller include a buffer analysis process. This buffer analysisprocess examines the content of the signal buffer for determiningwhether to provide unmodified playback from the buffer contents, whetherto compress buffer content, stretch buffer content, or whether toprovide for packet loss concealment for overly delayed or lost packets.

The signal stretching processes described herein are used to increasethe playback time of one or more signal segments as a way of providingadditional time in which to receive delayed signal packets across thenetwork. Unlike conventional signal stretching schemes which willdeclare a packet as a “late loss” when it is not received within acertain predetermined period of time, the adaptive audio playbackcontroller operates as a function of buffer content rather than packetreceipt time.

Therefore, unlike conventional stretching schemes, the audio playbackcontroller begins stretching the contents of the buffer whenever aparticular packet, e.g., packet n, has not arrived by the expected orscheduled time. In this case, the signal existing in the buffer isstretched until the delayed packet arrives, or until it is eventuallydeclared as “lost” based on one or more predetermined loss conditions,as described below. This differs from conventional stretching processesin that rather than immediately declaring a packet as a “late loss” whenit is not received within a predetermined period of time, the contentsof the buffer as well as the amount of stretching already performed andthe possible arrival of subsequent packets are all used to determine anappropriate time for declaring that packet to be a late loss.

In general, the stretching process is used to locate, create, orestimate samples that are inserted into the existing signal. Thesesamples are then blended with the original signal content using awindowing process to hide or minimize any perceivable artifacts thatwould otherwise exist at the boundary points between the insertedsamples and the original signal content. However, the type of windowingprocess used, and the methods for locating, creating or estimatingsamples for stretching, is dependent upon the content type of the framesin the buffer, i.e., “voiced frames,” “unvoiced frames,” or “mixedframes.”

For example, in an audio signal including speech, each segment of anyparticular frame will be either a “voiced” segment that includesquasi-periodic speech or some other quasi-periodic signal, an “unvoiced”segment which does not include any significant periodicity, or a “mixed”segment which includes both periodic and aperiodic components. Then, inorder to achieve optimal results, stretching that is specificallytargeted to the particular segment type, i.e., voice, unvoiced, ormixed, is applied.

The packet loss concealment processes described herein work incooperation with the signal stretching processes to address late loss ofpackets by attempting to hide such losses when necessary. In particular,once it is determined that a packet is lost, the system will no longerwait for that packet to be received. Loss concealment then takes theform of either a “mute mode,” or of a “loss concealment mode.” Inparticular, the mute mode is used to hide packet losses where a maximumdelay time has been exceeded without receiving any packets. In contrast,the loss concealment mode is used to hide packet losses where the delaytime has not been exceeded, but wherein the buffer has already beenstretched and a subsequent packet has already been received.

In one embodiment, muting provided by the mute mode is implementedgradually so as to minimize audible artifacts in the signal. Further, inanother embodiment, the signal is not entirely muted, but is insteadreduced to a “comfort noise” level that is computed for simulating anoise level similar to any noise that was present when the connectionwas active, but when there was no speech. Consequently, signal loss isnot readily apparent to the listener. This is important for maintainingapparent signal quality in lossy networks where the signal may be lostand reestablished a number of times during a typical communicationsession.

In general, the packet loss concealment mode operates by firstdetermining the number of signal samples that need to be insertedbetween current buffer content and future buffer content. In otherwords, this computation determines the number of samples that need to beused to fill the hole caused by a packet loss existing between a currentsignal frame and a future signal frame that have already been receivedinto the signal buffer. In one embodiment, given the computed number ofsamples, stretching is divided between the current and future buffercontent as a function of the average energy of that buffer content, withlower energy signal frames being preferentially stretched over higherenergy frames so as to minimize signal artifacts.

The signal compression processes described herein are provided toaddress the case where the signal buffer has become too full, with aresulting increase in signal delay. Therefore, by compressing the signalcontained in the buffer, playback time of the buffered signal isreduced, the buffer is at least partially emptied, and the signalplayback delay is reduced. As described herein, when compressing thesignal, the signal is examined to identify a segment of the signalwherein compression is expected to produce little or no artifacts,rather than simply compressing the next segment to be played out.

Further, rather than compress all segments of a given frame equally,better results are typically achieved by employing a hierarchical orlayered approach to compression. In particular, in an audio signalincluding speech, each segment of a frame will be either a “voiced”, an“unvoiced”, or a “mixed” segment, as previously described. Given thedetermination of segment type in the buffer, the desired compression isachieved in any given frame or frames by first compressing particularsegment types in a preferential hierarchical order.

2.2 System Architecture:

The processes summarized above are illustrated by the general systemdiagram of FIG. 2. In particular, the system diagram of FIG. 2illustrates the interrelationships between program modules forimplementing an adaptive audio playback controller for providingadaptive buffer dependent jitter control and packet loss concealment foran audio signal received across a packet-based network. It should benoted that any boxes and interconnections between boxes that arerepresented by broken or dashed lines in FIG. 2 represent alternateembodiments of the temporal audio scalar described herein, and that anyor all of these alternate embodiments, as described below, may be usedin combination with other alternate embodiments that are describedthroughout this document.

As illustrated by FIG. 2, a system and method for adaptive bufferdependent jitter control and packet loss concealment begins by receivinga stream of network packets 200 across a packet-based network. Thesepackets 200 are received by a signal input module 210. This signal inputmodule 210 then provides the received packets to a codec module 220which uses the appropriate conventional decoder to decode the receivedpackets 200 into one or more signal frames. These decoded signal framesare then stored in a signal buffer 230 as soon as they have beendecoded. This process for receiving network packets 200 via the signalinput module 210, decoding those packets 220, and storing the packetsinto the signal buffer 230 continues for as long as receipt of networkpackets 200 continues.

However, the signal buffer 230 does not continue to fill up during thistime. In fact, frames are read out of the buffer, on an as-needed basis,as quickly as possible so as to minimize buffer delay. However, ratherthen simply read the frames out of the buffer 230 for playback, a bufferanalysis module 240 is used to examine the contents of the buffer forthe purpose of determining whether to provide unmodified playback fromthe buffer contents, whether to compress buffer content, stretch buffercontent, or whether to provide for packet loss concealment for overlydelayed or lost packets. The buffer contents, whether or not modifiedare then gradually output for playback on a conventional playbackdevice. Besides standard computers, such playback devices also includewired and wireless telephones, cellular telephones, radio devices, andother packet-based communications systems or devices operable over apacket-based network.

In general, the determination of how to process the frames in the signalbuffer 230 is a function of buffer content. For example, where thebuffer 230 is full or nearly full, and there are no missing frames, eachdesired output frame is simply provided directly from the signal buffer230 to a frame output module 280 for playback on a playback device 290.

In the case where the size of the signal buffer 230 is too small, e.g.,because one or more expected packets have not yet been received, buthave not yet been declared as lost, then one or more frames possiblypresent in the signal buffer are stretched via a stretching module 260using a content-type specific stretching process so as to minimize anyartifacts that might be perceived by a human listener. This stretchingprocess is described in further detail below in Section 3.3. Thestretching then continues for as long as needed until receipt of thenext frame for playback, or until the delayed packet is declared to belost, i.e., a “late loss” packet.

In the case where the signal buffer 230 is too full, i.e., the bufferexceeds a predetermined maximum threshold length, and then one or moresegments of the signal buffer are compressed by a compression module250. This compression module 250 uses a novel hierarchical framecompression process for temporal compression of one or more signalframes.

A loss concealment module 270 is used to address the case where one ormore packets are declared to be a late loss. In this case, packet lossconcealment is used to hide or minimize artifacts that will result fromeither joining non-contiguous segments of the audio signal, or fromblending new samples into the existing content of the signal buffer 230for the purpose of filling any “holes” left in the signal as a result ofpacket loss or undue delay.

3.0 Operation Overview:

The above-described program modules are employed in the adaptive audioplayback controller. As summarized above, this adaptive audio playbackcontroller provides for automatic buffer-based adaptive jitter controland packet loss concealment for audio signals transmitted across apacket-based network as a function of buffered signal content. Further,the de-jittering and packet loss concealment processes described hereinare compatible with most conventional codecs for decoding and providinga playback of audio signals. The following sections provide a detailedoperational discussion of exemplary methods for implementing the programmodules described in Section 2.

In general, the adaptive audio playback controller operates by firstusing a conventional codec for decoding and reading transmitted signalframes into a signal buffer as soon as all information necessary todecode those frames have been received. Note that for some codecs, this“necessary information” may include previous packets, as long as theyhave not yet been declared as “losses.” Samples of the decoded audiosignal are then played out of the buffer according to the needs of theplayer device. Note that the size of the input frame read into thebuffer and the size of the output frame (i.e., the sample output to theplayer device) do not need to be the same. Input frame size isdetermined by the codec, and some codecs use larger frame sizes to saveon bitrate. Output frame size is generally determined by the bufferingsystem on the playout or playback device. For example, in a testedembodiment, a 10 ms output frame was used in combination with a 20 msinput frame. However, rather than simply playing back the frames, theadaptive audio playback controller stretches or compresses the signal,as necessary, to perform real-time jitter control and packet lossconcealment as a function of buffer content.

Primary components of the de-jittering processes include signalstretching processes, and signal compression processes. These processesoperate based on a maximum and minimum buffer size. In a testedembodiment, a 10 ms minimum buffer size was used to guarantee enoughdata is present in the buffer to allow for good quality stretching. Incontrast, the maximum buffer size is designed as a tradeoff betweenminimizing the probability that any given sample will need to bestretched, and the delay resulting from increased buffer size.

For example, in one embodiment, maximum buffer size was determined byperforming a conventional statistical modeling of the broadcast channelor network, and setting the maximum buffer size at a level that willguarantee receipt of at least a minimum threshold number of datapackets, such as, for example, 95% of the packets, before those packetsare needed for playback. Methods for performing such statisticalmodeling of packet receipt across a network channel are well known tothose skilled in the art, and will not be described in detail herein.

As noted above, one of the components of the adaptive audio playbackcontroller involves a signal stretching process. Conventional signalstretching schemes typically stretch a received frame of the audiosignal until the schedule arrival time for the next packet. However,these schemes will declare a packet as a “late loss” when it is notreceived within a certain predetermined period of time. For example,such schemes typically set a time limit for receiving a packet n thatexpires soon after the time a prior packet, i.e., packet n−1, wasreceived. If packet n is not received by that predetermined time, a lateloss is declared, and “loss concealment” techniques are then used forconcealing that loss. Thus, such schemes are packet-based.

In contrast, the adaptive audio playback controller described hereinoperates as a function of buffer content rather than packet receipttime. For example, unlike conventional stretching schemes, the audioplayback controller begins stretching the contents of the bufferwhenever a particular packet, e.g., packet n, arrives later than“scheduled.” In this case, the signal existing in the buffer isstretched until the delayed packet arrives, or until it is eventuallydeclared “lost.”

This process differs from conventional stretching schemes in that ratherthan immediately declaring a packet as a “late loss” when it is notreceived within a predetermined period of time, the contents of thebuffer, the amount of stretching already performed, and the reception ofany subsequent packets are all used to determine an appropriate time fordeclaring that packet to be a late loss. Consequently, the adaptiveaudio playback controller provides a significantly increased packetreceipt time prior to declaring a late loss for any given packet. As aresult, packet “late loss” is significantly reduced, thereby resultingin a significantly reduced use of packet loss concealment processes forreducing artifacts in the signal, and a perceptibly cleaner signalplayback.

In particular, rather than setting a time limit for declaring packetloss, the adaptive audio playback controller simply waits for the nextpacket to be received, or until one of several “loss conditions” aresatisfied, as described below. For example, one such loss condition isto set a maximum delay time for packet receipt. Given a sufficientlylong delay time T, late loss will only be declared in relatively extremedelay cases, when a signal connection was lost, or when a talk spurtended in the case where no information is sent about the end of the talkspurt. In a tested embodiment, values for the delay time T on the orderof about 20 ms to about 1 sec were used, with values of T around 100 mstypically providing good results.

A second loss condition relates to receiving a subsequent packet priorto receiving the next expected packet in the transmission. Typically,this results from either packet inversion, or actual packet loss. Asnoted above, conventional schemes typically will generally ignore packetarrival order, and wait the maximum amount of time regardless of whethera subsequent packet has been received or not. Instead, the adaptiveaudio playback controller reduces the time required to declare a lateloss whenever a subsequent packet is received prior to receiving theexpected packet. However, to minimize any declarations of “late loss”due to packet inversion, the adaptive audio playback controller waitsbefore declaring a loss, even if a subsequent packet has already beenreceived. Since packet inversions are rare, the waiting is kept to aminimum, in order to avoid introducing additional artifacts in thesignal. More specifically, the signal in the buffer will not bestretched beyond the period that the buffered signal would be stretchedin the case where a packet loss would be declared, as noted above. Oncethat time has been reached, the packet n is declared as lost, and thepacket loss concealment processes described below are used to reduce oreliminate artifacts in the signal.

The processes described below are generally illustrated by FIG. 3. Inparticular, as illustrated by FIG. 3, when new data 300 is available itis read and subsequently written 310 to the signal buffer 230. Then, ananalysis of the buffer content is made to determine whether the bufferis too low 320. If the contents of the signal buffer 230 are determinedto be too low, then the contents of the buffer are stretched 330 asdescribed in detail below. In contrast, if the contents of the buffer230 are determined not to be too low, then a determination is made as towhether the buffer is too full 340. In the case where buffer is toofull, then the contents of the buffer are compressed 350 as described indetail below. Finally, a segment of the buffer, unmodified, stretched,or compressed, is then played 360, one output frame at a time via aconventional playback device. These steps continue to loop, along withthe ongoing analysis of the signal buffer content for the purpose ofdetermining how to best handle incoming packets in a conventional lossyand delay prone packet-based network.

In another related embodiment, the stretching and compressing isutilized mostly to compensate for clock drift (i.e., small differencesin clock frequency) between encoder and decoder clocks. In thisembodiment, threshold buffer sizes (i.e., the buffer is too low, or thebuffer is too full) for initiating either stretching or compressing ofthe buffered signal can be relatively small, typically on the order ofabout one or two pitch periods.

3.2 Packet Loss Concealment:

As noted above, although late loss of packets is reduced by using anincreased delay time T, a loss concealment mode is still implemented tohide such losses when necessary. In particular, once it is determinedthat a packet is lost, the system will no longer wait for that packet tobe received. Loss concealment then takes the form of either a “mutemode,” or of a “loss concealment mode.”

For example, as illustrated by FIG. 4, a lost packet triggers either a“loss concealment mode” 460 or a “mute mode” 430. In particular, if anexpected packet, packet n, has been received 400, then there is nopacket loss. That packet is then decoded 410 and provided to the signalbuffer. However, if the expected current packet, packet n, has not beenreceived 400, then a determination is made as to whether the delay timeT has been exceeded 420. If the delay time T has been exceeded 420, thena packet loss is declared and the mute mode is entered 430.

Alternately, if the delay time T has not yet been exceeded 420, then adetermination is made as to whether the data in the signal buffer hasalready been stretched 440. If that data has been stretched 440, then adetermination as to whether any subsequent packet, e.g., packet n+1 orhigher, has already been received 450. If a subsequent packet has beenreceived 450, then a packet loss is declared and the concealment mode isentered 460. However, if either the buffer data has not been stretched440, or a subsequent packet has not yet been received 450, then theadaptive audio playback controller simply continues waiting for theexpected packet, i.e., packet n 470 while looping through the abovesteps, 400 through 460. Once the packet is either declared not lost 410,or lost (430 or 460), and the appropriate action taken, then the nextpacket, i.e., packet n+1, becomes the current packet, and theaforementioned steps (400 through 480) repeat.

3.2.1 Mute Mode and Comfort Noise:

As noted above, the mute mode 430 is entered when no packet is receivedfor a length of time exceeding some pre-determined threshold such as thedelay time T. In general, when no packet is received within the delaytime T, this non-receipt is interpreted as either the end of a talkspurt or a loss of connection. In either case, the receiver will “mute”the current signal. In one embodiment, this muting is implementedgradually so as to minimize audible artifacts in the signal. In anotherembodiment, the signal is not entirely muted, but is instead reduced toa “comfort noise” level. Comfort noise is frequently used inconventional communications systems for simulating a noise level similarto any noise that was present when the connection was active, but whenthere was no speech. Consequently, signal loss is not readily apparentto the listener. This is important for maintaining apparent signalquality in lossy networks where the signal may be lost and reestablisheda number of times during a typical communication session.

With respect to the mute mode 430, the adaptive audio playbackcontroller presents a unique process for generating comfort noise byusing a running comfort noise buffer containing a number of “silenceframes.” In a tested embodiment, using a comfort noise buffer of aboutthree or so silence frames provided good results. In general, whenever anew frame is received, the overall energy E of the frame is computed andcompared to the stored energy of the current silence frames in thecomfort noise buffer. If the current frame has lower energy than any ofthe frames already in the comfort noise buffer, then the frame havingthe highest energy is replaced with the current frame. Further, inaddition to storing the energy of the frame, the magnitude of the FFTcoefficients of the frames are also stored for use in synthesizing a“comfort noise frame,” as described below.

In a related embodiment, a periodic renewal of the silence frames in thebuffer is forced through use of a time-out mechanism so as to avoid anatypically low energy silence frame remaining in the buffer forever. Forexample, if a particular frame is in the buffer for over a predeterminedtime limit, such as, for example, 15 seconds, the nominal energy E_(i)of the frame is increased (but not the magnitude of the stored FFTcoefficients). This will increase the likelihood that the frame willeventually be replaced with a new frame having lower energy. Assuming a15 second time limit here, the E_(i) is doubled every 15 seconds, and asmall amount of an arbitrary frame, such as the current frame, forexample, is added to handle any cases where E_(i)=0.

When a comfort noise frame is needed, the buffered silence frames arethen used to generate one. In particular, the average magnitude of thestored silence frames is computed, and a random phase shift is added tothe FFT prior to computing the inverse FFT. This signal is thenoverlapped/added to the signal in the buffer using a conventionalwindow, such as, for example, a sine window. In particular, comfortnoise is created in any desired length by computing the Fouriertransform of the average magnitude of the silence frames, introducing arandom rotation of the phase into the FFT coefficients, and then simplycomputing the inverse FFT for each segment to create the comfort noiseframe. This produces a signal frame having the same spectrum, but nocorrelation with the original frames, thereby avoiding perceptibleartifacts in the signal. In addition, longer signals can be obtained byzero-padding the signal before computing the FFT. These synthesizedcomfort noise frames are then inserted into the signal playback by usinga windowing function to smooth the transition points between theoriginal and subsequent signal frames.

3.2.2 Loss Concealment Mode:

As noted above, the loss concealment mode 460 is entered whenever asubsequent frame is received, one or more intermediate frames aremissing, and the data in the signal buffer has already been stretched.Further, loss concealment is either “generic” or specific to whatevercodec is being used to decode the incoming packets. For example, manycodecs already provide loss concealment algorithms specified as part ofthe codec. In such a case, the packet loss concealment may use theexisting processes of the codec. In other cases, the prescribed lossconcealment for a particular codec may not exist or may be sub-optimal.This is often the case, since most loss concealment algorithms have beendesigned for constant-frame size environments, and have the constraintof preserving a fixed output frame length. However, when using thetechniques described herein the output frame size is not constrained bythe input frame size; therefore, this particular constraint isirrelevant with respect to the adaptive audio playback controller.Further, in either case, the determination of when such loss concealmentis to be applied, even with the existing codecs, differs fromconventional loss concealment methods by use of the aforementionedsignal buffer analysis for deciding whether frames are to be stretchedor compressed.

For example, a loss concealment mode designed for G.711 (PCM) codedspeech, but which is also appropriate for use with many otherconventional codecs, is illustrated by FIG. 5. Note that this lossconcealment mode provides an improvement over the standard G.711 lossconcealment algorithm, published as appendix 1 to the ITU-Trecommendation G.711. As described above with respect to FIG. 4, theconcealment mode 460 will only be entered when at least one subsequentframe has been received. For that reason, in addition to any signalframes still remaining in the signal buffer, i.e., “current buffercontent”, the frame buffer also contains some non-contiguous futuresegment of the input signal, i.e., “future buffer content.” The lostsegment corresponds to any missing or non-received samples existingbetween the current buffer content and the future buffer content.

As illustrated in FIG. 5, the first step in loss concealment is todetermine the number of signal samples 500 that need to be insertedbetween the current buffer content and the future buffer content. In thesimplest case, the number of samples is simply set equal to the numberof samples corresponding to the lost frame or frames represented by thelost packet. However, in another embodiment, a slightly more elaboratecomputation is used to determine the number of samples needed. Inparticular, as described above, some stretching of the signal buffercontent will have already occurred prior to packet loss concealment.

Consequently, a better estimate of the number of samples needed isdetermined by first subtracting the number of samples resulting fromthat stretching from the number of lost samples. Further, to allowenough data for windowing (i.e., overlapping/adding) the transitionbetween the inserted samples and the current and future buffer content,samples representing at least an additional half-window are added to thetotal number of samples to be inserted. Further, in one embodiment,additional samples are inserted to allow the alignment between the twosegments to be done in both directions.

Note that if too many frames are lost, any transition will likely soundunnatural. Consequently, in a related embodiment, to address this case,and to further reduce any resulting artifacts, the number of frames tobe replaced is limited to two frames. However, it should be noted that,if necessary to keep overall signal length, the signal may later befurther stretched at some other point in the data existing in the signalbuffer.

The next step is to compute a desired or target size for the futurebuffer content 510. The simplest method is to set the target size of thefuture buffer content equal to current size of the future buffercontent, plus the number of samples to be inserted plus the overlap/addwindow size divided by 2. In other words, as illustrated by Equation 1:

DF=LF+(K+OV)/2  Equation 1

where DF is the target size for the future buffer content, LF is theactual current size of the future buffer content, K is the number oftarget samples to insert, and OV is the overlap/add window size (i.e.,size of the sine window or other window used for the overlap/addoperation).

Another method for computing the desired size for the future buffercontent is described below in Section 3.2.2.1 with respect to FIG. 6.However, in any case, once the target size of the future buffer contentis computed, the future buffer is stretched from its current size toapproximately the target size. Any conventional stretching method may beused to complete the stretching operation. A novel stretching method isdescribed in a copending United States utility patent applicationentitled “A SYSTEM AND METHOD FOR PROVIDING HIGH-QUALITY STRETCHING ANDCOMPRESSION OF A DIGITAL AUDIO SIGNAL,” filed Sep. 10, 2003, andassigned Ser. No. 10/660,325, the subject matter of which is herebyincorporated herein by this reference.

In general, as described in the aforementioned copending patentapplication entitled “A SYSTEM AND METHOD FOR PROVIDING HIGH-QUALITYSTRETCHING AND COMPRESSION OF A DIGITAL AUDIO SIGNAL,” this novelstretching method provides an adaptive “temporal audio scalar” forautomatically stretching and compressing frames of audio signalsreceived across a packet-based network. Prior to stretching orcompressing segments of a current frame, the temporal audio scalar firstcomputes a pitch period for each frame for sizing signal templates usedfor matching operations in stretching and compressing segments.

Further, the temporal audio scalar also determines the type or types ofsegments comprising each frame. These segment types include “voiced”segments, “unvoiced” segments, and “mixed” segments which include bothperiodic and aperiodic components. The stretching or compression methodsapplied to segments of each frame are then dependent upon the type ofsegments comprising each frame. Further, the amount of stretching andcompression applied to particular segments is automatically variable forminimizing signal artifacts while still ensuring that an overall targetstretching or compression ratio is maintained for each frame.

Since the stretching process may produce slightly more (or less) thanthe desired samples, depending upon the content of the frame beingstretched, the necessary length of the current buffer content isestimated as the desired total length, plus the required overlap (forthe overlap/add process), minus the actual size of the future buffercontent after stretching. In other words, as illustrated by Equation 2:

DC=T+OV−AF  Equation 2

where DC is the desired or target size for current buffer content, T isthe target total size of the signal buffer after concealment, OV is theoverlap/add window size, as noted above, and AF is actual size of futurebuffer content after stretching. Given the target size of the currentbuffer content, the current buffer is then stretched by the necessaryamount to achieve that target size. Finally, a variable content-basedoverlap/add windowing process is applied to mix or fade the current andfuture buffer content into a continuous segment of the input signal.Note that this overlap/add process is described generally below withrespect to FIG. 7, and in more specific detail in the aforementionedcopending United States utility patent application entitled “A SYSTEMAND METHOD FOR PROVIDING HIGH-QUALITY STRETCHING AND COMPRESSION OF ADIGITAL AUDIO SIGNAL.”

3.2.2.1 Computing the Target Size for Buffer Content:

As noted above, a simple solution for determining the target size forstretching the future and current buffer content is given by Equation 1and Equation 2, respectively. However, an alternate solution isillustrated by FIG. 6. In general, as illustrated by FIG. 6, the genericapproach of Equation 1 and Equation 2 is modified to decide how much tostretch the future and current buffer content so as to minimizeperceivable artifacts in the stretched signal.

As illustrated by FIG. 6, the first step is to determine whether thebuffer content is sufficiently long to allow for stretching withoutnoticeable artifacts. In particular, the future buffer content is firstexamined to determine whether it is below a minimum size 600 to allowfor high quality stretching. For example, in one embodiment, a minimumsize of about two “pitch” periods may be used as an indicator of whetherthe content can be stretched without creating undesirable artifacts. Inanother tested embodiment, a minimum size of 280 samples was used.

When the future buffer content is below the minimum size 600, then aratio of the average energy of the future buffer content to the averageenergy of the current buffer content is set to zero 615, thus resultingin stretching of only the current buffer content (see discussion of Box645 below). Note that the current buffer does not need to be tested forminimum size, first because it is always kept larger then the minimumsize, and second, because in any case one of the two (current or futurebuffer content) have to be stretched to cover for the missing (lost)segment.

The reason for limiting stretching of signals less than two pitchperiods is that stretching a voiced segment without having at least twopitch periods will generally introduce undesirable artifacts into thesignal. As is well known to those skilled in the art, voiced sounds suchas speech are often modeled using quasi-periodic pulses that aretypically referred to as the fundamental frequency or “pitch.” However,as the concepts of pitch and pitch period are well known to thoseskilled the art, the determination of pitch and pitch period will not bedescribed herein.

In stretching the content of the signal buffer, stretching is preferablydivided between the current and future frames as a function of theenergy of each frame so as to minimize signal artifacts resulting fromthe stretching. In general, the amount of stretching of the future andcurrent buffer content is done in inverse proportion to the energy ofthat content. The reason for this approach is that, in general,stretching a low energy signal close to a high energy signal tends tomask audible artifacts. Thus, for example, if the future buffer contentincludes 80 percent of the total energy, and the current buffer contentincludes 20 percent of the energy, then the future buffer content willbe stretched by 20 percent and the current buffer content will bestretched by 80 percent of the extra samples needed.

When the future buffer content is not below the minimum size 600, theenergy of both the current buffer content and the future buffer contentis computed 620. These average energies are then used to compute a ratioof the average energy 625. In one embodiment, these ratios are then usedto compute the desired size of the future and current buffer content 645as a function of the ratio, existing buffer size, number of targetsamples needed, and the overlap/add window size. For example, the targetsize for the future buffer content may be computed as illustrated byEquation 3:

$\begin{matrix}{{{DF} = {{LF} + {( {K + {OV}} ) \cdot R}}},\{ \begin{matrix}{R = {{1/{RATIO}}\mspace{14mu} \begin{matrix}{{if}\mspace{14mu} {RATIO}} \\{{\neq 0};}\end{matrix}}} \\{R = {0\mspace{14mu} {otherwise}}}\end{matrix} } & {{Equation}\mspace{14mu} 3}\end{matrix}$

where DF is the desired or target size for future buffer content, LF isthe existing size of the future buffer content, K is the total number oftarget samples to insert, and OV is the overlap/add window size.

Similarly, the target size for the current buffer content could becomputed using an equation similar to Equation 3. Nevertheless, a moreappropriate solution is to use equation 2, which will give the sameresults if the actual stretching of the future plane happens exactly asrequested, but will also incorporate any small differences between thetarget and actual size of the future buffer after stretching.

However, rather than blindly applying Equations 2 and 3 to determine thetarget size for the future and current buffer content, better resultsare achieved by first examining the computed ratio 625 to determinewhether the future or current buffer content should actually bestretched.

In particular, in one embodiment, if the computed ratio 625 is less thana predetermined minimum threshold 630, then the ratio is set to zero 615so that the future buffer content will not be stretched at all becausethe relative energy of the future buffer content is so large compared tothe current buffer content that stretching of the future buffer contentwould likely result in noticeable artifacts. Similarly, if the computedratio 625 is greater than the predetermined minimum threshold 630, thena determination is made as to whether the computed ratio exceeds apredetermined maximum threshold 635.

If the predetermined maximum threshold 635 is exceeded, then the ratiois set to one 640 so that the current buffer content will not bestretched at all because the relative energy of the current buffercontent is so large compared to the future buffer content thatstretching of the current buffer content would likely result innoticeable artifacts. In one embodiment, stretching is distributedbetween the current and future buffer content before taking into accountthe stretching already performed in the current buffer content (as aresult of waiting for a particular frame, as described above), at whichpoint the minimum and maximum thresholds are applied as described above.

Next, whether the ratio is computed 625, or set to zero 615 or one 640as a function of the minimum and maximum ratio thresholds, the desiredor target buffer sizes are then computed as described above with respectto Equations 2 and 3. Finally, the future and current buffers contentsare stretched 650 (or not stretched if appropriate) by inserting theappropriate number of samples into each buffer to meet the target size.

3.2.2.2 Overlap/Add of Stretched Buffer Frames:

As noted above, once samples from one or both buffers have beenstretched enough to cover the lost segment of the signal, it isnecessary to window the samples for easing the transition points betweenthe original content of the current buffer and the contents of thefuture buffer. The aforementioned overlap/add process is used for thispurpose. This overlap/add process differs from conventional overlap/addprocedures in that it is dependent upon the content type of the signalin the buffers.

For example, in an audio signal including speech, each segment of anyparticular frame will be either a “voiced”, an “unvoiced”, or a “mixed”segment, as described above. Then, in order to achieve optimal results,an overlap/add process that is specifically targeted to the particularmix of segment types is applied.

In general, in contrast to conventional windowing schemes, differentwindows are used for each frame type mix (e.g., voiced/voiced,voiced/unvoiced, etc). Also, the alignment strategy is different fordifferent frame type mixes. For example, only in the case where neitherframe type is unvoiced, the frames are aligned. This alignment willmatch the pitch period of the current buffer with that of the futurebuffer before the overlap/add is performed. In particular, a “template”is first selected from the current buffer content of same length as theoverlap window. The future buffer content is then examined to identify amatch in the future buffer content. One method for identifying suchmatches is to simply compute the cross correlation of the template withthe beginning of the future buffer content. The largest peak in thecross correlation then represents the best match. The future buffercontent is then shifted by the offset, discarding any samples betweenthe start of the future buffer content and the best match. Then, becausethe two signal segments are correlated via the alignment, a sum-oneoverlap/add window is used to smooth the transitions between the currentand future buffer content. An example of such sum-one window is a Hannwindow.

In the case where at least one of the frame types is unvoiced, there istheoretically no correlation between samples. Consequently, there is noneed to perform the alignment as with the voiced samples. Therefore, asquare-sum-one window is used by the overlap/add process for smoothingthe transition points. An example of such a square-sum-one window is asine window.

Note that specific details of this frame-type dependent overlap/addprocess are provided in Section 3.2 of the aforementioned copendingpatent application entitled “A SYSTEM AND METHOD FOR PROVIDINGHIGH-QUALITY STRETCHING AND COMPRESSION OF A DIGITAL AUDIO SIGNAL.”

3.3 Codec-Specific Loss Concealment Modes:

The loss concealment procedures described above ignore any distortionsor artifacts produced by interdependencies between frames. This is not aproblem with signals which have been coded using codecs such as G.711(PCM). However, when speech has been encoded by some other codec, theloss of a frame will typically induce some mismatch between the internalstate of the decoder when compared to the state assumed by the encoder.Consequently, more noticeable artifacts may result from the stretchingdescribed above, which may stretch and thus reinforce segments whichwere not perfectly decoded. The procedures described above are stillapplicable to such cases, but will most likely yield sub-optimumresults.

However, the methods for stretching signals to conceal lost framesdescribed above may be modified to address particular codecs to addressframe interdependencies resulting from the particular codec used toencode the audio signal. In particular, one may take note of theexpected quality of certain segments following a loss, and take thatinto account when deciding whether or not to stretch that particularsegment. For example, the conventional “Siren Codec” (ITU-T G.722.1codec), currently used in Windows Messenger™ is based on the well knownModulated Lapped Transform (MLT). The only state information is 320partial samples that overlap between adjacent frames. In this case, thisknown partial information is used to produce results which are audiblysuperior to those produced by the standard Siren Codec errorconcealment.

3.3.1 Basic Modification to the Stretching Process:

The simplest approach to modify the stretching techniques describedabove in Section 3.2 is to ignore any incomplete segments of aSiren-coded signal. In particular, Siren frames are 20 ms (320 samples)each, but each Siren frame contains coefficients corresponding to a 640point MLT. Subsequent frames are then overlapped by 320 samples andadded. Therefore, if a single frame is missing, a total of 40 ms ofspeech will be incomplete. In one embodiment, the entire 40 ms isdeclared as lost, and the concealment processes described above areapplied to conceal that loss. However, this basic approach throws awayuseful information contained in the partial segments surrounding thelost frame.

3.3.2 Using Interdependency Information in the Stretching Process:

In another embodiment, rather than ignoring the partial information inthe surrounding frames, that information is used to create samples forextending the contents of the buffer. In this embodiment, the way theMLT is constructed is used advantageously to partially reconstruct asmany “lost” samples as possible. For example, because of the way inwhich the MLT is computed, the leading and trailing half of eachsurrounding segment is increasingly dominated by the signal that is tobe estimated for loss concealment, with the samples increasing inaccuracy towards the ends closest to the missing frame. Specifically, asis known to those skilled in the art of MLT computations with respect tothe G.722.1 codec:

-   -   “The MLT can be decomposed into a window overlap and add        operation, followed by a type IV Discrete Cosine Transform        (DCT). The window, overlap and add operation is given by:

v(n)=w(159−n)×(159−n)+w(160+n)×(160+n), for 0≦n≧159

v(n+160)=w(319−n)×(320+n)−w(n)×(639−n), for 0≦n≧159

where:

w(n)=sin((pi/640)(n+0.5)), for 0≦n≧320”

Consequently, if at the decoder side, the inverse DCT is performed, butthe overlap/add operation is not, the signal v[0:319] as defined abovewill be recovered. Further, note that v[0:159] is increasingly dominatedby x[160:319]. For example, v[159]=0.0025x[0]+0.999997x[319].Consequently, it should be clear that v[159] can be used as anapproximation for x[319]. Obviously, the further from the center of v,the worse the approximation is. In addition, it should also be notedthat since time reversing a signal does not affect its spectrum, thespectrum of the reversed part of x should be similar to that of theoriginal x.

Further, adding two uncorrelated signals is equivalent to adding theirspectrum. But, as the extremities of v are approached from either side,the two samples of x are increasingly close to each other, and thereforemore correlated. For this reason, rather than use all of the samples inv, the last few samples are eliminated, and the remaining samples wereused to estimate at least some of the missing samples before replacingany remaining lost samples using the stretching methods described above.The last 5 to 30 samples on each side of v may be discarded.

For example, assuming that the last 20 samples on each side of v arediscarded, then only the center 280 samples are used. Therefore, if asingle frame is lost, instead of discarding the whole 640 incompletesamples as described in section 3.3.1, the partial information is usedas a way of estimating some of these samples. In a tested embodiment,280 samples were used to estimate the corresponding 140 samples closestto each extremity of the missing samples, so that only the loss of thecenter 360 samples actually needs to be concealed using the stretchingprocesses described above. Further, because the estimated samples arenot as good as true samples, in one embodiment, they are not used tostretch the signal. Consequently, signal stretching is preferablyrestricted to samples which were completely decoded, rather then thosesamples that were estimated as described above.

3.4. Selective Signal Compression:

Due to the need to keep up with the real-time nature of thecommunication, the stretching processes described above are doneimmediately, whenever a frame is not received in time. Consequently,there is very little choice in whether a particular segment must bestretched (although there is some choice as to where a particularsegment is to be stretched, as described in the aforementioned copendingpatent application entitled “A SYSTEM AND METHOD FOR PROVIDINGHIGH-QUALITY STRETCHING AND COMPRESSION OF A DIGITAL AUDIO SIGNAL.”However, there is significantly greater flexibility in compressing thesignal for reducing signal delay when the signal buffer becomes toofull.

For example, when compressing the signal, it is typically a good idea towait for a segment of speech where compression is expected to producelittle or no artifacts, rather than simply compress the next segment tobe played out. One simple solution is to compress only in between talkspurts. However, a better process considers how much compression isdesired (i.e., how far behind in time signal playback is), and how easyit is to compress a particular segment while minimizing artifacts.Further, it is noted that the need to compress implies that a longsegment of the signal is in the buffer, and therefore there is somefreedom on where to compress that signal.

The selection of which segments to actually compress in any given frameor frames is an important decision, as it typically affects theperceived quality of the reconstructed signal for a human listener. Forexample, rather than compress all segments of the signal buffer equally,better results are typically achieved by employing a hierarchical orlayered approach to compression. In particular, as noted above, the typeof each segment is often already known by the time that compression isto be applied to a frame. Given this information, the desiredcompression is achieved in any given frame or frames by firstcompressing particular segment types in a preferential hierarchicalorder.

In particular, segments that represent voiced segments or silencesegments (i.e., segments that include relatively low energy aperiodicsignals) are compressed first. Next, unvoiced segments are compressed.Finally, mixed segments, or segments including transients arecompressed. The reason for this preferential order is that compressionof voiced or silence segments is easiest to accomplish without thecreation of noticeable artifacts. Compression of unvoiced segments isthe next easiest type to compress without noticeable artifacts. Finally,mixed segments and segments containing transients are compressed last,as such segments are the hardest to compress without noticeableartifacts.

Consequently, rather than compressing all segments equally in anyparticular frame or frames, better results are typically achieved byselectively compressing particular segments in those frames, orparticular frames. For example, compressing segments that representvoiced speech, silence or simple noise, while avoiding compression ofunvoiced segments or transients, produces a reconstructed signal havingreduced perceivable artifacts. If sufficient compression cannot beaccomplished by compressing voiced or silence segments, thennon-transitional unvoiced segments are compressed in the mannerdescribed above.

Finally, segments including transitions are compressed if sufficientcompression can not be achieved through compression of the voicedsegments or non-transitional unvoiced segments. This hierarchicalapproach to compression serves to limit perceivable artifacts in thereconstructed signal. Further, if sufficient unplayed frames areavailable, then the desired compression can be spread out over severalframes, as necessary, by compressing only those segments that willresult in the least amount of signal distortion or artifacts.

In general, once the particular segments to be compressed have beenselected or identified, compression of segments is handled in a mannersimilar to that described above for stretching of segments. For example,when compressing a voiced segment, a template is selected from withinthe segment, and a search for a match is performed. Once the match isidentified, the segments are windowed, overlapped and added, thuscutting out the signal between the template and the match. As a result,the segment is shortened, or compressed. On the other hand, whencompressing an unvoiced segment, either a random or predetermined shiftis used to delete a portion of the segment or frame, along with awindowing function such as a constant square-sum window to compress thesegment to the desired amount. Finally, mixed segments are compressedusing a weighted combination of the voiced and unvoiced methods asdescribed in the aforementioned copending patent application entitled “ASYSTEM AND METHOD FOR PROVIDING HIGH-QUALITY STRETCHING AND COMPRESSIONOF A DIGITAL AUDIO SIGNAL.”

3.5. Processing in the LPC Residual Domain:

In the preceding discussion, the adaptive audio playback controlleraccomplished adaptive compression and stretching of signal segments forproviding jitter control and packet loss concealment by acting on thesignal in the time domain. However, a signal can always be decomposedinto a spectral envelop, or (Linear Predictive Coding) LPC spectrum thatrepresents a frame-level spectrum, and an LPC residue that representsshort time information such as small details in the signal spectrum.Consequently, in one embodiment, the processes described above withrespect to stretching, compression, loss concealment and muting, areimplemented in the LPC residual domain.

In general, processing in the LPC residual domain has two mainadvantages over operating in the original signal domain. First,operating in the LPC residual domain produces fewer artifacts because amatch is guaranteed in the spectral domain, and the spectrum will evolvemuch more smoothly. Second, operating in the LPC reduces delay and mayreduce computational overhead because much shorter windows may be used.In fact, because of the close match, the use of the overlap window cansimply be ignored altogether, thereby reducing any algorithmic delay tothe time required to process a very few number of samples (e.g., 16samples used in the LPC filter). However, even if a window is used here,a window with just a few samples will provide good results with reducedsignal artifacts.

In this embodiment, an LPC filter is estimated from the contents of thesignal buffer at a regular interval, such as, for example about 5 ms.The received signal is then passed through the estimated LPC filter inorder to obtain an LPC residual. Then, the processes described above areperformed on the LPC residual signal rather than on the original timedomain signal. Tags for the location of each original point for the LPCfilters are kept, then, before playing out, the signal is simply inversefiltered through an interpolated LPC filter. This LPC filter is obtainedby interpolating the original LPC filter between corresponding points,as illustrated by FIG. 7.

As illustrated by FIG. 7, the first step in using the LPC residualrather than the time domain of the signal is to get a new frame of data700 by decoding received network packets 200 transmitted across aconventional packet-based network such as the Internet or otherpacket-based communications network. Once decoded, the frame isimmediately sent to the signal buffer 230. At this point, rather thanperforming an analysis of the signal buffer as in the time-domain case,an LPC filter is computed or estimated 705 for the received frame usingconventional LPC computation techniques. In another embodiment, a singleLPC filter is used for each frame. However, in a related embodiment, anew filter is estimated and used over relatively short periods, such as,for example, about every 5 ms.

Next, the LPC residual is computed 710 using the estimated LPC filter.However, in another embodiment, better results may be achieved byinterpolating between the estimated filters and then using a series ofestimated and interpolated LPC filters for computing the LPC residualfrom the received frame. The computed LPC residual is provided to an LPCresidual signal buffer 720, which is basically the LPC residual versionof the signal buffer 230. In fact, the LPC residual signal buffer 720 isthen treated in the same manner as the signal buffer 230 for the purposeof determining whether to stretch, compress, conceal losses, or mute thesignal 725 as described above. In fact, stretching, compressing, andloss concealment 725 are accomplished exactly as described above withrespect to the time domain signal, except that in the LPC residualdomain, there is no need for a long overlap window. In particular,rather than use a long window for overlap/add operations, a sharptransition, or a simple 3 point window provides satisfactory results.

As with the time domain case, a determination is then made as to whethera signal frame is needed for playback 730 by the output device 290.Then, using pointers to the current location of the original LPCfilters, interpolated LPC filters are generated 735. These interpolatedLPC filters are then used for performing an inverse LPC filter 740 ofthe potentially modified (stretched, compressed, loss concealment, mute)LPC residual. Note that the modified (stretched, compressed, lossconcealment) LPC residual is the inverse LPC filtered input frame.Therefore, if no processing (stretching, compression, or lossconcealment) has been done, then the original input frame will beproduced here. The resulting synthesized or original signal frame isthen output 745 and sent to the playback device 290.

The steps described above continue looping, 730-745, and 730-700 untilthe end of the input signal has been reached and there is no more datato provide to the playback device 290.

The foregoing description of the adaptive audio playback controller forperforming automatic buffer-based adaptive jitter control and packetloss concealment for audio signals transmitted across a packet-basednetwork as a function of buffer content has been presented for thepurposes of illustration and description. It is not intended to beexhaustive or to limit the invention to the precise form disclosed. Manymodifications and variations are possible in light of the aboveteaching. Further, it should be noted that any or all of theaforementioned alternate embodiments may be used in any combinationdesired to form additional hybrid embodiments of the temporal audioscalar described herein. It is intended that the scope of the inventionbe limited not by this detailed description, but rather by the claimsappended hereto.

1-21. (canceled)
 22. A method for adaptive playback of received framesof an audio signal transmitted across a packet-based network, comprisingusing a computing device to: receive a packetized audio signal broadcastacross a packet-based network; decode each received packet and store theresulting decoded signal frame in a signal buffer; output a currentpacket in the case where the current packet has been received across thepacket-based network; instantiate a mute mode whereby a playback of theaudio signal is at least partially muted when a maximum delay time forreceiving the current packet has been exceeded, and the current packethas not been received; instantiate a packet loss concealment modewhereby the playback of the audio signal is modified for reducingaudible artifacts resulting from one or more lost packets when a currentbuffer content has been previously temporally stretched, the currentpacket has not yet been received, and a packet subsequent to the currentpacket has already been received.
 23. The method of claim 22 furthercomprising analyzing the content of the signal buffer for determining acurrent length of the contents of the signal buffer.
 24. The method ofclaim 23 further comprising stretching and outputting one or moredecoded frames from the signal buffer when the current length of thecontents of the signal buffer is less than a predetermined minimumbuffer size.
 25. The method of claim 24 wherein the predeterminedminimum buffer size is optimized to compensate for clock drift betweenan encoder and a decoder.
 26. The method of claim 23 further comprisingcompressing and outputting one or more decoded frames from the signalbuffer when the current length of the contents of the signal buffer isgreater than a predetermined maximum buffer size.
 27. The method ofclaim 24 wherein the predetermined maximum buffer size is optimized tocompensate for clock drift between an encoder and a decoder.
 28. Themethod of claim 22 wherein modification of the playback of the audiosignal is in the packet loss concealment mode comprises: computing anaverage energy for a frame in the signal buffer immediately precedingthe current packet that has not yet been received; computing an averageenergy for a frame in the signal buffer immediately succeeding thecurrent packet that has not yet been received; and determining a targetframe size for both the preceding and succeeding frames as a function ofthe ratio of the of the average energy of the succeeding frame to thepreceding frame.
 29. The method of claim 28 wherein determining a targetframe size for both the preceding and succeeding frames furthercomprises stretching the succeeding frame and the preceding frames by anamount that is inversely proportional to the ratio of the averageenergy.
 30. The method of claim 29 wherein instantiating the mute modecomprises generating and providing playback of a comfort noise signal toreplace lost packets, said comfort noise signal being generated from atleast one signal frame stored in a silence buffer, said signal framehaving been determined to represent nominal background noise.
 31. Themethod of claim 30 further comprising periodically replacing the signalframes in the silence buffer as a function of a computed energy of thoseframes.
 32. The method of claim 30 wherein generating the comfort noisesignal from the at least one signal frame stored in a silence buffercomprises: automatically computing the FFT of the at least one signalframe stored in the silence buffer; introducing a random rotation of thephase into the FFT coefficients; computing the inverse FFT for eachsegment, thereby creating the at least one synthetic silence segment;and providing the at least one silence segment for playback as thecomfort noise signal. 33-50. (canceled)
 51. A physical computer-readablemedia having computer executable instructions stored thereon foradaptive playback of received frames of an audio signal transmittedacross a packet-based network, said computer-executable instructionscomprising: receiving a packetized audio signal broadcast across apacket-based network; decoding each received packet and store theresulting decoded signal frame in a signal buffer; outputting a currentpacket in the case where the current packet has been received across thepacket-based network; instantiating a mute mode whereby a playback ofthe audio signal is at least partially muted when a maximum delay timefor receiving the current packet has been exceeded, and the currentpacket has not been received; and instantiating a packet lossconcealment mode whereby the playback of the audio signal is modifiedfor reducing audible artifacts resulting from one or more lost packetswhen a current buffer content has been previously temporally stretched,the current packet has not yet been received, and a packet subsequent tothe current packet has already been received.
 52. The computer-readablemedia of claim 51 further comprising instructions for analyzing thecontent of the signal buffer for determining a current length of thecontents of the signal buffer.
 53. The computer-readable media of claim51 further comprising instructions for stretching and outputting one ormore decoded frames from the signal buffer when the current length ofthe contents of the signal buffer is less than a predetermined minimumbuffer size.
 54. The computer-readable media of claim 51 furthercomprising instructions for compressing and outputting one or moredecoded frames from the signal buffer when the current length of thecontents of the signal buffer is greater than a predetermined maximumbuffer size.
 55. A system for providing adaptive playback of receivedframes of an audio signal transmitted across a packet-based network,comprising using a computing device for: receiving a packetized audiosignal broadcast across a packet-based network; decoding each receivedpacket and store the resulting decoded signal frame in a signal buffer;outputting a current packet in the case where the current packet hasbeen received across the packet-based network; instantiating a mute modewhereby a playback of the audio signal is at least partially muted whena maximum delay time for receiving the current packet has been exceeded,and the current packet has not been received; and instantiating a packetloss concealment mode whereby the playback of the audio signal ismodified for reducing audible artifacts resulting from one or more lostpackets when a current buffer content has been previously temporallystretched, the current packet has not yet been received, and a packetsubsequent to the current packet has already been received.
 56. Thesystem claim 55 further comprising analyzing the content of the signalbuffer for determining a current length of the contents of the signalbuffer.
 57. The system claim 55 further comprising stretching andoutputting one or more decoded frames from the signal buffer when thecurrent length of the contents of the signal buffer is less than apredetermined minimum buffer size.
 58. The system of claim 57 whereinthe predetermined minimum buffer size is optimized to compensate forclock drift between an encoder and a decoder.
 59. The system claim 55further comprising compressing and outputting one or more decoded framesfrom the signal buffer when the current length of the contents of thesignal buffer is greater than a predetermined maximum buffer size.